arXiv:hep-ex/0112003v2 5 Dec 2002 


1 


Redundant Arrays of IDE Drives 

D. A. Sanders, Member, IEEE, L. M. Cremaldi, Member, IEEE, V. Eschenburg, C. N. Lawrence, C. 

Riley, Member, IEEE, D. J. Summers, D. L. Petravick 


Abstract — The next generation of high-energy physics 
experiments is expected to gather prodigious amounts of 
data. New methods must be developed to handle this data 
and make analysis at universities possible. We examine 
some techniques that use recent developments in commod¬ 
ity hardware. We test redundant arrays of integrated drive 
electronics (IDE) disk drives for use in offline high-energy 
physics data analysis. IDE redundant array of inexpen¬ 
sive disks (RAID) prices now equal the cost per terabyte 
of million-dollar tape robots! The arrays can be scaled to 
sizes affordable to institutions without robots and used when 
fast random access at low cost is important. We also explore 
three methods of moving data between sites; internet trans¬ 
fers, hot pluggable IDE disks in FireWire cases, and writable 
digital video disks (DVD-R). 

Keywords —RAID, EIDE, FireWire. 


I. Introduction 

W E report tests, using the Linux operating system, 
of redundant arrays of integrated drive electronics 
(IDE) disk drives for use in particle physics Monte Carlo 
simulations and data analysis Q, |^. Parts costs of to¬ 
tal systems using commodity IDE disks are now at the 
$4000 per terabyte level. A revolution is in the making. 
Disk storage prices have now decreased to the point where 
they equal the cost per terabyte of 300 terabyte Storage 
Technology tape silos. The disks, however, offer far better 
granularity; even small institutions can afford to deploy 
systems. The faster random access of disk versus tape is 
another major advantage. Our tests include reports on 
software redundant arrays of inexpensive disks - Level 5 
(RAID-5) systems running under Linux 2.4 using Promise 
Ultra 100 disk controllers. RAID-5 protects data in case of 
a catastrophic single disk failure by providing parity bits. 
Journaling file systems are used to allow rapid recovery 
from system crashes. We also report on using FireWire 
(IEEE 1394) to PCI (Peripheral Component Interconnect) 
interfaces. FireWire PCI cards allow sixty-three devices 
(e.g. a combination of computers and disks) per card. The 
maximum Firewire bus speed is currently limited to 400 
megabits per second. FireWire is also hot pluggable. 

Our data analysis strategy is to encapsulate data and 
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CPU processing power together. Data is stored on many 
PCs. Analysis of a particular part of a data set takes place 
locally on, or close to, the PC where the data resides. The 
network backbone is only used to put results together. If 
the I/O overhead is moderate and analysis tasks need more 
than one local CPU to plow through data, then each of 
these disk arrays could be used as a local file server to a 
few computers sharing a local ethernet switch. These com¬ 
modity 8-port gigabit ethernet switches would be combined 
with a single high end, fast backplane switch allowing the 
connection of a thousand PCs. We have also successfully 
tested using Network File System (NFS) software to con¬ 
nect our disk arrays to computers that cannot run Linux 
2.4. 

We examine three ways of moving data between sites; in¬ 
ternet transfers, hot pluggable IDE disks in FireWire cases, 
and writable digital video disks (DVD-R). Writable 4.7 GB 
DVD-R disks are now available for $5. They can be read 
by $60 DVD-ROM drives and written by the $500 Pioneer 
DVR-A03 drive [|. 

RAID Q stands for Redundant Array of Inexpensive 
Disks. Many industry offerings meet all of the qualifica¬ 
tions except the inexpensive part, severely limiting the size 
of an array for a given budget. This may change. The 
different RAID levels can be defined as follow: 

• RAID-0: “Striped.” Disks are combined into one physi¬ 
cal device where reads and writes of data are done in par¬ 
allel. Access speed is fast but there is no redundancy. 

• RAID-1: “Mirrored.” Fully redundant, but the size is 
limited to the smallest disk. 

« RAID-4: “Parity.” For N disks, 1 disk is used as a parity 
bit and the remaining N — 1 disks are combined. Protects 
against a single disk failure but access speed is slow since 
you have to update the parity disk for each write. 

« RAID-5: “Striped-Parity.” As with RAID-4, the effec¬ 
tive size is that of V — 1 disks. However, since the parity 
information is also distributed evenly among the N drives 
the bottleneck of having to update the parity disk for each 
write is avoided. Protects against a single disk failure and 
the access speed is fast. 

RAID-5, using enhanced integrated drive electronics 
(EIDE) disks under Linux software, is now available |^. 
Redundant disk arrays do provide protection in the most 
likely single disk failure case, that in which a single disk 
simply stops working. This removes a major obstacle to 
building large arrays of EIDE disks. However, RAID-5 
does not totally protect against other types of disk failures. 
RAID-5 will offer limited protection in the case where a sin¬ 
gle disk stops working but causes the whole EIDE bus to 
fail (or the whole EIDE controller card to fail), but only 
temporarily stops them from functioning. This would tern- 
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porarily disable the whole RAID-5 array. If replacing the 
bad disk solves the problem, i.e. the failure did not per¬ 
manently damage data on other disks, then the RAID-5 
array would recover normally. Similarly if only the con¬ 
troller card was damaged then replacing it would allow the 
RAID-5 array to recover normally. However, if more than 
one disk was damaged, especially if the file or directory 
structure information was damaged, the entire RAID-5 ar¬ 
ray would be damaged. The remaining failure mode would 
be for a disk to be delivering corrupted data. There is no 
protection for this inherent to RAID-5; however, a longitu¬ 
dinal parity check on the data, such as a checksum record 
count (CRC), could be built into event headers to flag the 
problem. Redundant copies of data that are very hard to 
recreate are still needed. RAID-5 does allow one to ignore 
backing up data that is only moderately hard to recreate. 

II. Large Disks 

In today’s marketplace, the cost per terabyte of disks 
with HIDE interfaces is about a third that of disks with 
SCSI (Small Computer System Interface), as illustrated in 
Fig. 1^. The EIDE interface is limited to 2 drives on each 
bus and SCSI is limited to 7 (14 with wide SCSI). The only 
major drawback of EIDE disks is the limit in the length of 
cable connecting the drives to the drive controller. This 
limit is nominally 18 inches; however, we have successfully 
used 24 inch long cables [^. Therefore, one is limited to 10 
disks per box for an array (or perhaps 20 with a “double 
tower”). To get a large RAID array one needs to use large 
capacity disk drives. There have been some problems with 
using large disks, primarily the maximum addressable size. 
We have addressed these problems in an earlier paper |Q. 
Because of these concerns and because we wanted to put 
more drives into an array than could be supported by the 
motherboard we opted to use PCI disk controller cards. 
We tested both Promise Technologies ULTRA 66 [|| and 
ULTRA 100 disk controller cards, which each support 
four drives. 

Using arrays of disk drives, as shown in Table ^ the cost 
per terabyte is similar to that of cost of Storage Technol¬ 
ogy tape silos. However, RAID-5 arrays offer a lot better 
granularity since they are scalable down to a terabyte. For 
example, if you wanted to store 10 TB of data you would 
still have to pay about $1,000,000 for the tape silo but only 
$40,000 for a RAID-5 array. Thus, even small institutions 
can afford to deploy systems. Therefore, as seen in Fig. ^ 
“you can have your cake and eat it too”. 

III. RAID Arrays 

There exist disk controllers that implement RAID-5 pro¬ 
tocols right in the controller, for example Sware’s Escalade 
7850 1^, which will handle up to eight EIDE drives. 

These controllers cost $600 and did not support disk drives 
larger than 137 Gigabytes |^; so we focused our attention 
on software RAID-5 implementations i , [^, which we 
tested extensively. 


Speed 
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Fig. 1 

Historically the speed and cost of data storage has 
INCREASED AS ONE MOVED FROM TAPE TO DISK TO RAM. EIDE 
RAID-5 DISK ARRAYS ADD ANOTHER LAYER TO THE DATA STORAGE 
CAKE. One doesn’t have to worry as much about TAPE BACKUP 
EXCEPT FOR DATA THAT IS VERY HARD TO RECREATE. ThE CHANCE 
OF LOSING DATA IS LOWER THAN WITH PLAIN SCRATCH DISKS. ThE 
COST OF EIDE RAID-5 IS close to that of tape robots and 

THE RANDOM ACCESS SPEED OF DISK IS MUCH FASTER. 


A. Hardware 

We have examined both Maxtor DiamondMax 0, 0, 
and IBM DeskStar hard disks. For RAID-5 the 
disk partitions must be all of the same size. The only 
trouble we had was when Maxtor changed the capacity 
for the 80 GB disk from 81.9 GB to 80 GB. We had to 
repartition the 81.9 GB disks to 80 GB (plus a wasted 
partition of 1.9 GB). Fortunately this happened to a test 
array and not while trying to replace a failed disk in a 
working RAID-5 array. Disk manufacturers have recently 
decided to define one GB as 1000 MB, rather than 1024 
MB. The drives we consider for use with a RAID-5 array are 
compared in Table In general, the internal I/O speed of 
a disk is proportional to its rotational speed and increases 
as a function of platter capacity. 

When assembling an array we had to worry about the 
“spin-up” current draw on the 12V part of the power sup¬ 
ply. With 8 disks in the array (plus the system disk) we 
would have exceeded the capacity of the power supply that 
came with our tower case, so we decided to add a second 
off-the-shelf power supply rather than buying a more ex¬ 
pensive single supply. By using 2 power supplies we benefit 
from under loading the supplies. The benefits include both 
a longer lifetime and better cooling since the heat generated 
is distributed over 2 supplies, each with their own cooling 
fans. We used the hardware shown in Table |i| for our array 
test. Many of the components we chose are generic; many 
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TABLE I 

Comparison of Large EIDE Disks for a RAID-5 Array 


Disk Model 
Maxtor D540X [j^ 

Maxtor D536X[16] 


Maxtor D540X[15] 
IBM 75GXP [|^] 
IBM 120GXP [ra 


Spin-Up 
Cost GB per Current 
GB RPM per GB platter at 12V 


80 

5400 

$2.11 

20 

2.00A 

100 

5400 

$2.27 

33 

0.64A 

160 

5400 

$1.85 

40 

1.80A 

75 

7200 

$3.19 

15 

2.00A 

120 

7200 

$2.91 

40 

2.00A 


components from other manufacturers also work. We have 
measured the wall power consumption for the whole disk 
array box in Table ||. It uses 276 watts at startup and 156 
watts during normal sustained running. 

TABLE II 

700 GB RAID-5 Configuration 


System Unit 

Component Price 

100GB Maxtor system disk |16[ $227 

8 - 100GB Maxtor RAID-5 disks 0 $227 

2 - Promise ATA/100 PCI cards [T $27 

4 - StarTech 24” ATA/100 cablesj| $3 

AMD Athlon 1.4 GHz/266 CPU p| $120 

Asus A7A266 motherboard, audio po|| $132 
2 - 256MB DDR PC2100 DIMMs $35 

In-Win Q500P Full Tower Case [|3 $77 

Sparkle 15A @ 12V power supply]]^ $34 

2 - Antec 80mm ball bearing case fans $8 

110 Alert temperature alar m $15 

Pine 8MB AGP video card $20 

SMC EZ card 10/100 ethernet $12 

Toshiba 16x DVD, 48x CDROM $54 

Sony 1.44 MB floppy drive $12 

Key Tronic 104 key PS/2 keyboard $7 

DEXXA 3 button PS/2 mouse $4 

Total $2682 


To install the second power supply we had to modify our 
tower case with a jigsaw and a hand drill. We also had to 
use a jumper to ground the green wire in the 20-pin block 
ATXPWR connector to fake the power-on switch. 

When installing the two disk controller cards care had 
to be taken that they did not share interrupts with other 
highly utilized hardware such as the video card and the 
ethernet card. We also tried to make sure that they did 
not share interrupts with each other. There are 16 possible 
interrupt requests (IRQs) that allow the various devices, 
such as EIDE controllers, video cards, mice, serial, and 
parallel ports, to communicate with the CPU. Most PC 


operating systems allow sharing of IRQs but one would 
naturally want to avoid overburdening any one IRQ. There 
are also a special class of IRQs used by the PCI bus, they 
are called PCI IRQs (PIRQ). Each PCI card slot has 4 
interrupt numbers. This means that they share some IRQs 
with the other slots; therefore, we had to juggle the cards 
we used (video, 2 EIDE controllers, and an ethernet). 

When we tried to use a disk as a “Slave” on a mother¬ 
board EIDE bus, we found that it would not run at the 
full speed of the bus and slowed down the access speed of 
the entire RAID-5 array. This was a problem of either the 
motherboard’s basic input/output system (BIOS) or EIDE 
controller. This problem was not in evidence when using 
the disk controller cards. Therefore, we decided that rather 
than take a factor of 10 hit in the access speed we would 
rather use 8 instead of 9 hard disks. 


B. Software 


For the actual tests we used Linux kernel 2.4.5 with the 


RedHat 7 (see h.ttp://www.redhat.com/) distribution (we 
had to upgrade the kernel to this level). The lates t sta¬ 
ble kernel version is 2.4.18 (see http://www.lwn.net/ ). We 
needed the 2.4.x kernel to allow full support for “Journal¬ 
ing” file systems. Journaling file systems provide rapid re¬ 
covery from crashes. A computer can finish its boot-up at a 
normal speed, rather than waiting to perform a file system 
check (FSCK) on the entire RAID array. This is then con¬ 
ducted in the background allowing the user to continue to 
use the RAID array. There are now 4 different Journaling 
file systems: XFS, a port from SGI jH); JFS, a port from 
IBM ext3 |3^, a Journalized version of the standard 
ext2 file system; and ReiserFS from namesys B- Com¬ 
parisons of these Journaling file systems have been done 
elsewhere When we tested our RAID-5 arrays only 
ext3 and the ReiserFS were easily available for the 2.4.x 
kernel; therefore, we tested 2 different Journaling file sys¬ 
tems; ReiserFS and ext3. We opted on using ext3 for two 
reasons I) At the time there were stability problems with 
ReiserFS and NFS (this has since been resolved with kernel 
2.4.7) and 2) it was an extension of the standard ext2fs (it 
was originally developed for the 2.2 kernel) and, if synced 
properly could be mounted as ext2. Ext3 is the only one 
that will allow direct upgrading from ext2, this is why it is 
now the default for RedHat 7.2. 

NFS is a very flexible system that allows one to manage 
files on several computers inside a network as if they were 
on the local hard disk. So, there’s no need to know what 
actual file system they are stored under nor where the files 
are physically located in order to access them. Therefore, 
we use NFS to connect these disks arrays to computers that 
cannot run Linux 2.4. We have successfully used NFS to 
mount this disk array on the following types of comput¬ 
ers: a DECstation 5000/150 running Ultrix 4.3A, a Sun 
UltraSparc 10 running Solaris 7, a Macintosh G3 running 
MacOSX, and various Linux boxes with both the 2.2 and 
2.4 kernels. We are currently using two of these RAID-5 
boxes to run analysis software with the BaBar KANGA 
code and the CMS CMSIM/ORCA code. 
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We have performed a few simple speed tests. The first 
was “hdparm -tT /dev/xxx”. This test simply reads a 64 
MB chunk of data and measures the speed. On a single 
drive we saw read/write speeds of about 30 MB/s. On the 
whole array we saw a drop to 28 MB/s. When we tried writ¬ 
ing a text file using a simple FORTRAN program (we wrote 
“All work and no play make Jack a dull boy” 10® times), 
the speed was 22.34 MB/s While mounted via NFS over 
100 Mb/s ethernet the speed was 2.12 MB/s, limited by 
both the ethernet speed and communication overhead. In 
the past H] , we have been able to get a much higher fraction 
of the rated ethernet bandwidth by using the lower level 
TCP/IP socket protocol Q in place of the higher level 
NFS protocol. TCP/IP sockets are more cumbersome to 
program, but are much faster. 

We also tested what actually happens when a disk fails 
by turning the power off to one disk in our RAID-5 array. 
One could continue to read and write hies, but in a “de¬ 
graded” mode, that is without the parity safety net. When 
a blank disk was added to replace the failed disk, again one 
could continue to read and write hies in a mode where the 
disk access speed is reduced while the system rebuilt the 
missing disk as a background job. This speed reduction in 
disk access was due to the fact that the parity regeneration 
is a major disk access in its own right. For more details, 
see reference Q. 

The performance of Linux IDE software drivers is im¬ 
proving. The latest standards Q include support for 
command overlap, RE AD/WRITE direct memory access 
QUEUED commands, scatter/gather data transfers with¬ 
out intervention of the CPU, and elevator seeks. Com¬ 
mand overlap is a protocol that allows devices that require 
extended command time to perform a bus release so that 
commands may be executed by the other device on the bus. 
Command queuing allows the host to issue concurrent com¬ 
mands to the same device. Elevator seeks minimize disk 
head movement by optimizing the order of I/O commands. 

We did encounter a few problems. We had to modify 
“MAKEDEV” to allow for more than eight IDE devices, 
that is to allow for disks beyond “/dev/hdg”. For ver¬ 
sion 2.x one would have to actually modify the script; 
however, for version 3.x we just had to modify the file 
“/etc/makedev.d/ide”. 

Another problem was the 2 GB hie size limit. Older op¬ 
erating system and compiler libraries used a 32 bit “long- 
integer” for addressing hies; therefore, they could not nor¬ 
mally address hies larger than 2 GB (2®^). There are 
patches to the Linux 2.4 kernel and glibc but there are 
still some problems with NFS and not all applications use 
these patches. 

We have found that the current underlying hie systems 
(ext2, ext3, reiserfs) do not have a 2 GB hie size limit. 
The limit for ext2/ext3 is in the petabytes. The 2.4 kernel 
series supports large hies (64-bit offsets). Current versions 
of GNU libc support large hies. However, by default the 

^ Since we originally submitted this paper we have tested a new 
Asus motherboard (the A7M266 with the AMD 761 North Bridge 
chip) and got significant increases in speed for the RAID-5 array. 


32-bit offset interface is used. To use 64-bit offsets, G/G-b-1- 
code must be recompiled with the following as the hrst line: 

#define _FILE_OFFSET_BITS 64 

or the code must use the *64 functions (i.e. open becomes 
open64, etc.) if they exist. This functionality is not in¬ 
cluded in GNU FORTRAN (g77); however, it should be 
possible to write a simple wrapper C program to replace 
the OPEN statement (perhaps called open64). We have 
succeeded in writing hies larger than 2 GB using a sim¬ 
ple G program with “#dehne _ FILE_ OFFSET. BITS 64” 
as the hrst line. This works over NFS version 3 but not 
version 2. 

While RAID-5 is recoverable for a hardware failure, there 
is no protection against accidental deletion of hies. To ad¬ 
dress this problem we suggest a simple script to replace the 
“rm” command. Rather than deleting hies it would move 
them to a “/raid/Trash” or better yet a “/raid/.Trash” di¬ 
rectory on the RAID-5 disk array (similar to the “Trash 
can” in the Macintosh OS). The system administrator 
could later purge them as space is needed using an algo¬ 
rithm based on criteria such as hie size, hie age, and user 
quota. 


IV. FireWire 


FireWire was developed by Apple and is an IEEE stan¬ 
dard (IEEE 1394) dehning a high speed serial bus. This bus 
is also named “i.Link” by Sony. It is referred to as IEEE 
1394 or just 1394 in the Linux world |^. It is a serial 
bus similar in principle to the Universal Serial Bus (USB), 
but runs at speeds of up to 400 Mb/s and is intended to 
replace the SGSI bus; however, it is not centered around a 
PG (i.e. there may be none or multiple PGs on the same 
bus). The FireWire bus allows up to sixty-three devices per 
chain. Also, because it has a mode of transmission which 
guarantees bandwidth, it is used for digital video cameras 
and similar devices. In general it is hot swappable. 

There are 2 main chipsets supported under Linux. 
The supported chipsets are Texas Instruments PCIL- 
ynx/PGILynx2 and OHGI compliant chips (produced by 
various companies). FireWire drivers are now included in 
RedHat and other distributions and are supported in the 
2.4.x kernel (with patches for the 2.2.x kernel). However, 
not all drivers are included in a standard installation nor is 
it a default option when upgrading the kernel. The driver 
for storage devices, such as hard disks (SBP-2) , was not 
included in kernels until the 2.4.7 kernel. For these reasons, 
we are including the basic instructions here. 

We got FireWire working on a Linux box by following 
the following steps: 

1. We used an inexpensive PGI FireWire controller, for a 
cost of $25. It was an OHCI-I394 card with a VIA con¬ 
troller. 

2. The kernel used was Linux 2.4.12 as released by Linus 
Torvalds and Alan Gox’s -ac3 patch. Alan’s patches can 


be downloaded at http://www.bz2.us.kernel.org/pub/linux 
/kernel/people/alan/linux-2.4/. The -ac series is basically 
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what Red Hat and other distributions base their kernels 
on, and includes drivers not in stock 2.4.12. 

3. We had to enabled FireWire support when configuring 
the kernel. This involved turning on the following: 

IEEE 1394 (FireWire) support (EXPERIMENTAL) 
DHCI-1394 support 
SBP-2 support (Harddisks etc.) 

(The RAWIO driver is not necessary for storage devices. 
In addition, you will need the SCSI disk driver enabled in 
the kernel, even if you don’t have a real SCSI interface on 
the machine. This is because FireWire is treated as a SCSI 
channel.) 

4. After rebooting with the new kernel, some recent dis¬ 
tributions should detect the FireWire card and install the 
correct drivers. If not, the following modules need to be 
manually loaded, in this order: 


ohcil394 

sbp2 


The sbp2 driver is somewhat finicky; it helps to have a few 
seconds delay between the two modprobes. The command 
“cat /proc/scsi/scsi” should list the attached storage de¬ 
vices (disks, CD-ROMs, etc.): 

Attached devices: 

Host: scsil Channel: 00 Id: 00 Lun: 00 

Vendor: Maxtor Model: 1394 storage Rev: 60 
Type: Direct-Access ANSI SCSI revision: 02 

Some of the output may not make sense if an IDE-FireWire 
(1394) bridge is in use; we noticed the non-Maxtor drive 
had strange output. 


At the moment, the devices are added in more-or-less 
random order. The only way to guarantee ordering is to 
manually hot-plug them. We don’t know if this is a soft¬ 
ware limitation or an artifact of the plug&play nature of 
FireWire (there’s no permanent ID setting like IDE or SCSI 
have). Presumably if one writes a volume header label 
(e.g. with tune2fs -L) to each disk you could get around 
this problem. 


Hot plugging seems to work with the following caveat. 
Do not unplug a FireWire device without unmounting it 
first. While you do not have to shutdown the computer 
to remove the device you do have to unmount it. Once 
unmounted, disconnect the device physically and then run 
“rescan-scsi-bus.sh -r”. For new devices, plug them in and 
run “rescan-scsi-bus.sh”. The script can be downloaded at 
http://www.garloff.de/kurt/linux/ rescan-scsi-bus.sh 


We successfully configured two FireWire disks, after for¬ 
matting the disks using ext2, (but any common file sys¬ 
tem, such as ext3 or RieserFS, would work) as a RAID-5 
array. One of the disks used the new Oxford 911 FireWire 
to EIDE interface chip [Q, |^, |3^, |3^. We have suc¬ 
ceeded in writing a DVD-R using the Pioneer DVR-A03 
over FireWire. 


V. High Energy Physics Strategy 
A. Data Storage Strategy - Event Persistence 

We encapsulate data and CPU processing power. A 
block of real or Monte Carlo simulated data for an analy¬ 
sis is broken up into groups of events and distributed once 
to a set of RAID disk boxes, which each may also serve a 
few additional processors via a local 8-port gigabit ethernet 
switch 1^ Dual processor boxes would also add more local 
CPU power. Events are kept physically contiguous on disks 
to minimize I/O. Events are only built once. Event paral¬ 
lel processing has a long history of success in high energy 
physics Q, 1^, Q, [Q. The data from each anal¬ 
ysis are distributed among all the RAID arrays so all the 
computing power can be brought to bear on each analysis. 
For example, in the case of an important analysis (such 
as a Higgs analysis), one could put 50 GB of data onto 
each of 100 RAID arrays and then bring the full comput¬ 
ing power of 700 CPUs into play. Instances of an analysis 
job are run on each local cluster in parallel. Several anal¬ 
yses jobs may be running in memory or queued to each 
local cluster to level loads. The data volume of the results 
(e.g. histograms) is small and is gathered together over the 
network backbone. Results are examined and the analysis 
is rerun. The system is inherently fault tolerant. If three 
of a hundred clusters are down, one still gets 97% of the 
data and analysis is not impeded. 

RAID-5 arrays should be treated as fairly secure, large, 
high-speed “scratch disks”. RAID-5 just means that disk 
data will be lost less frequently. Data which is very hard 
to re-create still needs to reside on tape. The inefficiency 
of an offline tape vault can be an advantage. Its harder to 
erase your entire raw data set with a single keystroke, if 
thousands of tapes have to be physically mounted. Some¬ 
one may ask why all the write protect switches are being 
reset before all is lost. Its the same reason the Air Force 
has real people with keys in ICBM silos. 

The granularity offered by RAID-5 arrays allows a uni¬ 
versity or small experiment in a laboratory to set up a 
few terabyte computer farm, while allowing a large Analy¬ 
sis Site or Laboratory to set up a few hundred terabyte 
or a petabyte computer system. For a large site, they 
would not necessarily have to purchase the full system at 
once, but buy and install the system in smaller parts. This 
would have two advantages, primarily they would be able 
to spread the cost over a few years and secondly, given the 
rapid increase in both CPU power and disk size, one could 
get the best “bang for the buck”. 

What would be required to build a 300 terabyte system 
(the same size as a tape silo)? Start with eight 160 GB 
Maxtor disks in a box. The Promise Ultral33 card allows 


2 D-Link 
Linksys 
Netgear 
Netgear 


DGS—1008T 8-port gigabit ethernet switch $765 
EG0008 8-port gigabit ethernet switch $727 

GS508T 8-port gigabit ethernet switch $770 

GS524T 24-port gigabit ethernet switch $1860 


(See 

and 


.^-l/lllfs 1 M.. JM-I' 

http: //www.rllink.mm / , 

http://www.linksys.com/products/ , 

nttp: / /WWW. netgear. com 

>) 









6 


one to exceed the 137 GB limit Each box provides 7 x 
160 GB = 1120 GB of usable RAID-5 disk space in addi¬ 
tion to a GPU for computations. 300 terabytes is reached 
with 270 boxes. Use 40 commodity 8-port gigabit ether- 
net switches ($800 each) to connect the 270 boxes to a 
40-port, high end, fast backplane ethernet switch Q, |Q. 
This could easily fit in a room that was formerly occupied 
by a few old Mainframes, say an area of about a hundred 
square meters. The power consumption would be 42 kilo¬ 
watts. One would need to build up operational experience 
for smooth running. As newer disks arrive that hold yet 
more data, even a petabyte system would become feasible. 

B. Data Transfer Strategy 

For small amounts of data and to update analysis soft¬ 
ware one can use internet file transfers, preferably via 
“rsync”. The program “rsync” remotely copies files and 
uses a remote-update protocol to greatly speedup file trans¬ 
fers when the destination file already exists. This remote- 
update protocol allows “rsync” to transfer just the differ¬ 
ences between two sets of files across the network link, us¬ 
ing an efficient checksum-search algorithm. Some of the 
additional features of “rsync” are: support for copying 
links, devices, owners, groups and permissions; can use 
any transparent remote shell, including “rsh” or “ssh”; can 
tunnel over encrypted connections and is compatible with 
Kerberized rsh/ssh authentication; and does not require 
root privileges. The only problem is the available band¬ 
width. Internet2 may ameliorate this problem but given 
the prevalence of Napster-like programs competing with 
data transfers, this is not a certainty. The other method 
would be to use some form of removable, and universally 
readable media. Two new methods are hot pluggable IDE 
disks in $90 FireWire cases [^, and DVD-R disks. Since 
FireWire works on Linux, Windows 98SE, and Macintosh 
OS9 and OSX, one can use hot pluggable EIDE disks in 
FireWire cases as a simple method of transferring reason¬ 
able amounts of data or even full sets of analysis software. 
In any case, its best not to try and transfer any chunk of 
data more than once. Local CPUs and disks are far less 
expensive than wide area networks. 

Writable 4.7 GB DVD-R disks can be purchased for $5. 
They can be read by $60 DVD-ROM drives and written 
by the $500 Pioneer DVR-A03 drive |^. Linux is capable 
of writing DVD-Rs. However, the software to do so is not 
available under a free license. It is an enhanced version of 
“cdrecord”, the free program that writes CDs, CD-Rs, and 
CD-RWs. A demo version that will write up to 1 GB is 
available from the author’s FTP site [^. An alternative, 
which is free, is to use the patch for cdrecord |^]. Using 
this patched version of “cdrecord”, we have succeeded in 
writing a DVD-R using the Pioneer DVR-A03 both inter¬ 
nally (it’s an EIDE device) and over FireWire. The spe- 

® Promise Technology’s Ultral33 TX2 PCI controller card uses a 
wider 48-bit data address versus the older 28-bit address, which is 
limited to 2^® 512 byte blocks or 137 Cigahytes The card cnritrnls 
four disks and has a $59 list price. (See http://www.promise.com/ 
marketing/datasheet / file/Ultral33tx2DS. pdt j 


cific kernel used was linux 2.4.18 plus the prel patch from 
Marcelo Tosatti the prel-ac2 patch from Alan 

Cox and the ieeel394 tree ||3^. We used a patched 
version of cdrecord l.llall. The image was a standard 
iso9660 filesystem image created with “mkisofs”, including 
a 2880 kB boot image. (The DVD itself contains a com¬ 
plete copy of the February 27, 2002 snapshot of Debian 
Linux’s upcoming 3.0 release, which would normally take 
up six 700 MB CD-Rs.) The image took approximately 25 
minutes to write at 2x speed. The long-term reliability of 
DVD-R media still needs to be explored. 

VI. Conclusion 

We have tested redundant arrays of IDE disk drives for 
use in offline high energy physics data analysis and Monte 
Carlo simulations. Parts costs of total systems using com¬ 
modity IDE disks are now at the $4000 per terabyte level, 
the same cost per terabyte as Storage Technology tape si¬ 
los. The disks, however, offer much better granularity; 
even small institutions can afford them. The faster ac¬ 
cess of disk versus tape is a major added bonus. We have 
tested software RAID-5 systems running under Linux 2.4 
using Promise Ultra 100 disk controllers. RAID-5 provides 
parity bits to protect data in case of a single catastrophic 
disk failure. Tape backup is not required for data that 
can be recreated with modest effort. Journaling file sys¬ 
tems permit rapid recovery from crashes. Our data anal¬ 
ysis strategy is to encapsulate data and CPU processing 
power. Data is stored on many PCs. Analysis for a partic¬ 
ular part of a data set takes place locally on the PC where 
the data resides. The network is only used to put results to¬ 
gether. Commodity 8-port gigabit ethernet switches com¬ 
bined with a single high end, fast backplane switch would 
allow one to connect a thousand PCs, each with a terabyte 
of disk space. Some tasks may need more than one CPU 
to go through the data even on one RAID array. For such 
tasks dual CPUs and/or several boxes on one local 8-port 
ethernet switch should be adequate and avoids overwhelm¬ 
ing the backbone switching fabric connecting an entire in¬ 
stallation. Again the backbone is only used to put results 
together. We successfully performed simple tests of three 
methods of moving data between sites; internet transfers, 
hot pluggable EIDE disks in FireWire cases, and DVD-R 
disks. 

Current high energy physics experiments, like BaBar at 
SLAC, feature relatively low data acquisition rates, only 3 
MB/s, less than a third of the rates taken at Fermilab fixed 
target experiments a decade ago The Large Hadron 

Collider experiments CMS and Atlas, with data acquisition 
rates starting at 100 MB/s, will be more challenging and 
require physical architectures that minimize helter skelter 
data movement if they are to fulfill their promise. In many 
cases, architectures designed to solve particular processing 
problems are far more cost effective than general solutions 
0, §, HI. Some of the techniques explored in this 
paper, to physically encapsulate data and CPUs together, 
may be useful. 
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